On the error exponent of variable-length block-coding schemes 
over finite-state Markov channels with feedback * 



Giacomo Comoj Serdar Yiiksel-'-and Sekhar Tatikonda ^ 



o 
o 

(N 



Abstract 



The error exponent of Markov channels with feedback is studied in the variable- 
length block-coding setting. Burnashev's [5J classic result is extended and a single letter 
characterization for the reliability function of finite-state Markov channels is presented, 
under the assumption that the channel state is causally observed both at the transmitter 
' and at the receiver side. Tools from stochastic control theory are used in order to treat 

channels with intersymbol interference. In particular the convex analytical approach to 
Markov decision processes [4] is adopted to handle problems with stopping time horizons 
arising from variable- length coding schemes. 

O ! 1 Introduction 

The role of feedback in channel coding is a long studied problem in information theory. 
^ I In 1956 Shannon [24j proved that noiseless causal output feedback does not increase the 
' capacity of a discrete memoryless channel (DMC). Feedback, though, can help in improving 
■ the trade-off between reliability and delay of DMCs at rates below capacity. This trade-off 
i is traditionally measured in terms of error exponent; in fact, since Shannon's work, much 
' research has focused on studing error exponents of channels with feedback. Burnashev [5] 
found a simple exact formula for the reliability function (i.e. the highest achievable error 
, exponent) of a DMC with perfect causal output feedback in the variable-length block-coding 
setting. The present paper deals with a generalization of Burnashev's result to a certain class 
of channels with memory. Specifically, we shall prove a simple single-letter characterization 
of the reliability function of finite-state Markov channels (FSMCs), in the general case when 
a I intersymbol-interference (ISI) is present. Under mild ergodicity assumptions, we will prove 
that, when one is allowed variable-length block-coding with perfect causal output feedback 
and causal state knowledge both at the transmitter and at the receiver end, the reliability 
function has the form 

Eb{R) = D (^1 - i?G(0,C). (1) 

In ([1]), is the transmission rate, measured with respect to the average transmission time. 
The capacity C and the coefficient D are quantities which will be defined as solution of finite 
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dimensional optimization problems involving the stochastic kernel describing the FSMC. The 
former will turn out to equal the maximum, over all choices of the channel input distributions 
as a function of the channel state, of the conditional mutual information between channel 
input and the pair of channel output and next channel state given the current state, whose 
marginal distribution coincides with the induced ergodic state measure (see The latter 
will instead equal the average, with respect to the induced ergodic state measure, of the 
Kullback-Leibler information divergence between the joint channel output and next channel 
state distributions associated to the pair of most distinguishable choices of a channel input 
symbol as a function of the current state (see (jl2p ). 

The problem of characterizing error exponents of memoryless channels with feedback has 
been addressed in the information theory literature in a variety of different frameworks. Par- 
ticularly relevant are the choice of block versus continuous transmission, the possibility of 
allowing variable-length coding schemes, and the way delay is measured. In fact, much more 
than in the non-feedback case, these choices lead to very different results for the error expo- 
nent, albeit not altering the capacity value. In continuous transmission systems information 
bits are introduced at the encoder, and later decoded, individually. Continuous transmis- 
sion with feedback was considered by Horstein [15], who was probably the first showing 
that variable-length coding schemes can give larger error exponents than fixed- length ones. 
Recently, continuous transmission with fixed delay has attracted renewed attention in the 
context of anytime capacity [23j. In this paper, however, we shall restrict ourselves to block 
transmission, which is the framework considered by the largest part of the previous literature. 

In block transmission systems the information sequence is partitioned into blocks of fixed 
length which are then encoded into channel input sequences. When there is no feedback 
these sequences need to be of a predetermined, fixed length in order to guarantee that trans- 
mitter and receiver remain synchronized. When there is feedback, instead, the availability 
of common information shared between transmitter and receiver makes it possible to use 
variable-length schemes. Here the transmission time is allowed to dynamically depend on the 
channel output sequence. It is known that exploiting the possibility of using variable-length 
block-coding schemes guarantees high gains in terms of error exponent. In fact, Dobrushin 
showed that the sphere-packing bound still holds for fixed-length block-coding schemes 
over symmetric DMCs even when perfect output feedback is causally available the encoder 
(a generalization to nonsymmetric DMCs was addressed in [H]). Even though fixed-length 
block-coding schemes with feedback have been studied (see [Ml ID]) the above-mentioned 
results pose severe constraints on the performance such schemes can achieve. Moreover, no 
closed form for the reliability function at all rates is known for fixed-length block coding with 
feedback, but for the very special class of symmetric DMCs with positive zero-error capacity 
(cf. [7, pag.199]). It is worth to mention that the situation can be much different for con- 
tinuous alphabet channels. For the additive white Gaussian noise channel (AWGNC) with 
average power constraint, Shalkwijk and Kailath [26] proved that a doubly exponential error 
rate is achievable by fixed-length block codes. However, when a peak power constraint to the 
input of an AWGNC is added, then this phenomenon disappear as shown in ^2\- At the same 
time it has been also well-known that, if variable length coding schemes are allowed, then the 
sphere-packing exponent can be beaten even when no output feedback is available but for a 
single una tantum bit guaranteeing synchronization between transmitter and receiver. This 
situation is traditionally referred to as decision feedback and was studied in [12] (see also [TJ 
pag.201]). 

A very simple exact formula was found by Burnashev [5j for the reliability function of 
DMCs with full causal output feedback in the case variable-length block-coding schemes. 
Burnashev's analysis combined martingale theory arguments with more standard information 
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theoretic tools. It is remarkable that in this setting the reliability function is known, in a very 
simple form, at any rate below capacity, in sharp contrast to what happens in most channel 
coding problems for which the reliability function can be exactly evaluated only at rates close 
to capacity. Another important point is that Burnashev exponent of a generic DMC can 
dramatically exceed the sphere-packing exponent; in particular it approaches capacity with 
nonzero slope. 

Thus, variable-length block coding appears a natural setting for transmission over chan- 
nels with feedback. In fact, it has already been considered by many authors after Burna- 
shev's landmark work. A simple two-phase iterative scheme achieving Burnashev exponent 
was introduced by Yamamoto and Itoh in [33] . More recently, low-complexity variable- length 
block-coding schemes with feedback have been proposed and analyzed in [21]. The works [28j 
and [29j dealt with universality issues, addressing the question whether Burnashev exponent 
can be achieved without exact knowledge of the statistics of the channel but only knowing 
it belongs to a certain class of DMCs. In [2j a simplification of Burnashev's original proof 
[5] is proposed, while [17] is concerned with the characterization of the reliability function of 
DMCs with feedback and cost constraints. In [22] low-complexity schemes for FSMCs with 
feedback are proposed. However, to the best of our knowledge, no extension of Burnashev's 
theorem to channels with memory has been considered. 

The present work deals with a generalization of Burnashev's result to FSMCs. As an ex- 
ample, channels with memory, and FSMCs in particular, model transmission problems where 
fading is an important component as for instance in wireless communication. Information 
theoretical limits of FSMCs both with and without feedback have been widely studied in the 
literature: we refer to the classic textbooks [16^ [3T] and references therein for overview of 
the available literature (see also [13]). It is known that the capacity is strongly affected by 
the hypothesis about the nature of the channel state information (CSI) both available at the 
transmitter and at the receiver side. In particular while output feedback does not increase 
the capacity when the state is causally observable both at the transmitter and at the receiver 
side (see [27J for a proof, first noted in ^24j ) , it generally does so for different information pat- 
terns. In particular, when the channel state is not observable at the transmitter, it is known 
that feedback may help improving capacity by allowing the encoder in estimating the channel 
state [27]. However, in this paper only the case when the channel state is causally observed 
both at the transmitter and at the receiver end will be considered. Our choice is justified by 
the aim to separate the study of the role of output feedback in channel state estimation from 
its effect in allowing better reliability versus delay tradeoffs for variable-length block-coding 
schemes. 

In [27j a general stochastic control framework for evaluating the capacity of channels 
with memory and feedback has been introduced. The capacity has been characterized as the 
solution of a dynamic programming average cost optimality equation. Existence of a solution 
to such an equation implies information stability. Also lower bounds a la Gallager to the error 
exponents achievable with fixed-length coding schemes are obtained in [27j. In the present 
paper we follow a similar approach in order to characterize the reliability function of variable- 
length block-coding schemes with feedback. Such an exponent will be characterized in terms 
of solutions to certain Markov decision problems. The main new feature posed by variable- 
length schemes is that we have to deal with average cost optimality problems with a stopping 
time horizon, for which standard results in Markov decision theory cannot be used directly. 
We adopt the convex analytical approach of [1] and use Hoeffding-Azuma inequality in order 
to prove a strong uniform convergence result for the empirical measure process. This allows 
us to find sufficient conditions on the tails of a sequence of stopping times for the solutions 
of the corresponding average cost optimality problems to be asymptotically approximated by 
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the solution of the corresponding infinite horizon problem, for which stationary policies are 
known to be optimal. 

The rest of this paper is organized as follows. In Section 2 causal feedback variable-length 
block-coding schemes for FSMCs are introduced, and capacity and reliability function are 
defined as solution of optimization problems involving the stochastic kernel describing the 
FSMC. The main result of the paper is then stated in Theorem [TJ In Section 3 we prove an 
upper bound to the best error exponent achievable by variable-length block-coding schemes 
with perfect feedback over FSMCs. The main result of that section is contained in Theorem 
[7] which generalizes Burnashev result. Section 4 is of a technical nature and deals with 
Markov decision processes with stopping time horizons. Some stochastic control techniques 
are reviewed and the main result is contained in Theorem 1111 which is then used to prove that 
the bound of Theorem [7] asymptotically coincides with the reliability function ([T]). In Section 
5 a family of simple iterative schemes based on a generalization of Yamamoto-Itoh's [33j is 
proposed and its performance is analyzed showing that this family is asymptotically optimal 
in terms of error exponent. Finally, in Section 6 an explicit example is studied. Section 7 
presents some conclusions and points out to possible topics for future research. 

2 Statement of the problem and main result 

2.1 Stationary ergodic Markov channels 

Throughout the paper X, y, S will respectively denote channel input, output and state 
spaces. All are assumed to be finite. 

Definition 1 A stationary Markov channel is described by: 

• a stochastic kernel consisting in a family {P{ • , ■ \s,x) G 7^(8 x 3^)[s € 5, x S X} of 
probability measures over S x y, indexed by elements of S and X ; 

• an initial state distribution fii in 'P{S). 
For a channel as in DeflU let 

Ps{s+\ s,x) := ^P(s+,?/| s,x) 
yey 

be the 5-marginals. We shall say that a Markov channel as above has no ISI when the 
5-marginals do not depend on the chosen channel input, i.e. 

Ps{s+\s,Xl) = Ps{S-^\s,X2) , Vs,S+ e 5, Xl,X2 G A". (2) 

We will consider the associated stochastic kernels 

{Qi-, ■\s,u)eViS xy)\seS,ue ViX)} , {Qsi-\s,u) £V{S)\sg S,ue ViX)} , 
where for every channel input distribution u in 'P{X) 

Q(s+,y|s,M) := ^P(s+,y|s,x)u(x), Qs{s+\ s,u) := ^ Ps{s+\ s,x)u{x) . (3) 

Given vr : 5 — > 'P(^) (we shall refer to such a map as a deterministic stationary policy), 
denote by 
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the state transition stochastic matrix induced by tt. With an abuse of notation, for any 
map f : S ^ X we shall write Qf in place of Qsf^.y Throughout the paper we will restrict 
ourselves to FSMCs satisfying the following ergodicity assumption. 

Assumption 2 For every f : S ^ X the stochastic matrix Qf is irreducible. 

Assumption [2] can be relaxed or replaced by other equivalent assumptions. Here we limit 
ourselves to observe that it involves the S'-marginals {Ps} of the Markov channel only. More- 
over it is purely discrete condition, since it requires a finite number of finite directed graphs 
to be strongly connected. Since taking a convex combination does not reduce the support. 
Assumption [2] guarantees that for every deterministic stationary policy n : S ^ V{X) the 
stochastic matrix is irreducible. Then, Perron-Frobenius theorem guarantees that Q-,^ has 
a unique invariant measure in V{S) which will be denoted by /^jr- Notice that in the non-ISI 
case Assumption [2] is tantamount to requiring the strict positivity of the 5-marginals of the 
stochastic kernel. 



2.2 Capacity of ergodic FSMCs 

To any ergodic FSMC we associate the mutual information cost function c : S x V{X) — > M, 

, , / \ I \i (^) y| •5) 3;) 
c Os, " = u{x)P{y, y\ s, x) log — -— ■ , 5) 

and define its capacity as 

C := max > iiAs^ci.s.TxisX) = max 1{X\Y,SAS\. (6) 

In the rightmost side of Q the term /(A; 5"+, y| 5) denotes the conditional mutual informa- 
tion [B] between X and the pair (5*+, y) given S", where S is an 5-valued r.v. whose marginal 
distribution is given by the invariant measure A is an ^Y- valued r.v. whose conditional 
distribution given S is described by the policy vr, while 5+ and Y are respectively an 5-valued 
r.v. and a 3^-valued r.v. whose joint conditional distribution given A and S is described by 
the stochastic kernel P(S'_|_,y| 5, A). Notice that in particular the mutual information cost 
function c is continuous over S x V(X^ and takes values in the bounded interval [0, log \X\\ . 

The quantity C defined above is known to equal the capacity of the ergodic Markov 
channel we are considering when perfect causal CSI is available at both transmission ends, 
with or without output feedback [27]. It is important to observe that, due to the presence of 
ISI in the channel model we are considering, the policy vr plays a dual role in the optimization 
problem in ([5]) since it affects both the mutual information cost c{s, 7r(s)) = /(A; 5+, Y\S = s) 
and the ergodic channel state distribution /^^ with respect to which the former is averaged. 

In the case when there is no ISI, i.e. when ([2]) is satisfied, this phenomenon disappears. 
In fact, since the invariant measure is independent of the policy vr we have that ([6]) reduces 
to 

C = > Li(s) max c(s,px)=7 P-is) max I(X:Y\S = s) , (7) 

Pxev{x) Pxev{x) 

where in the rightmost side of ([7|) the quantity maXp^g-p(^) liX; Y\S = s) coincides with the 
capacity of the DMC associated to the state s. The simplest case of FSMCs with no ISI is 
obtained when the state sequence forms an i.i.d. process independent from the channel input 
with distribution fj,, i.e. when 

Ps{s+\ s,x) = iJ,{s+) , Vs,s+g5,xG^. 
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In this case, it is not difficult to check that © reduces to the capacity of a DMC with input 
space X' = S"^ -the set of all maps from S to output space y' := S x y -the Cartesian 
product of S times 3^-, and transition probabilities given by 

P'iy'\x'):=^t^is)Piy'\s,x'is)), x' : S ^ X , y' e S x y . (8) 
ses 

Observe the difference with respect to the case when the state is causally observed at the 
transmitter only, whose capacity was first found in [25] . While the input space of the equiv- 
alent DMC is the same in both cases, its output space is larger in the case we are dealing 
with in this paper with respect to that addressed by Shannon, since we are assuming that 
the state is causally observable also at the receiver end. 

Finally, notice that, when the state space is trivial (i.e. when |5| = 1), ([6]) reduces to the 
usual definition of the capacity of a DMC. 



2.3 Burnashev coefficient of FSMCs 

Consider now the cost function d : S x 'P{X) [0, +oo] 

d{s,u):= sup E u{x)Q{s+,y\s,u)log ^l'+'yl''''l . (9) 

Notice that the term to be optimized in the righthand side of ([9]) equals the Kullback-Leibler 
information divergence between the probability measures Q{- , • | s,it) and Q{- , ■ | in 
V{S X y). It follows that, if we introduce the quantities 

A := min{As| s G 5} , A^ := min < minP(s+, y| s, x)\ 5+, y : 3 z : y| s, z) > 0> , 

(10) 

we have that the cost function d is bounded and continuous over S x V{X) if and only if A 
is strictly positive, i.e. 

A>0 c^max := sup d{s,u) < +(X> . (11) 

Define the Burnashev coefficient of a Markov channel as 

D:= sup V/x^(s)d(s,7r(s)). (12) 

Notice that D is finite iff (jlip holds. Moreover, a standard convexity argument allows to 
conclude that both the suprema in ^ and in (|12p are achieved in some corner points, so that 



^ = max V /ij„(s) ^ ^ P(s+,y|s,/o(s))log— — — 

foJv-S~*x s+esyey P{s+,y\sji{s)) 

= max V,x;„(.)Z)(P(-, .|s,/o(s))||P(., •|5,/i(s))) . 



(13) 



Similarly to what already noted for the role of policy vr in the optimization problem ([6]), 
it can be observed that, due to the presence of ISI, the map /o has a dual effect in the 
maximization in ()13p since it affects both the Kullback-Leibler information divergence cost 
D {P{ ■ , "Is, /o(-s))|| Pi - , -{s, /i(s))) and the ergodic state measure /ijg. Notice the asym- 
metry with the role of the map /i whose associated ergodic measure instead does not come 
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Figure 1: Information patterns for variable- length block-coding schemes on a FSMC with 
causal feedback and CSI. 



into the picture at all in the definition of the coefficient D. Once again, in the absence of ISI, 
([13]) simplifies to 

D = y^^l{s) max ■\s,xo)\\P{- , ■\s,xr)) . 

We observe that in the memoryless case (which can be recovered when \S\ = 1) the 
coefficient D coincides with the Kullback-Leibler information divergence between the output 
measures associated to the pair of most distinguishable inputs, the quantity originally denoted 
with the symbol Ci in [5]. When the state space is nontrivial (|iS| > 1), and the channel state 
process forms an i.i.d. sequence independent from the channel input, then the Burnashev 
coefficient D reduces to that of the equivalent DMC with enlarged input space X' = and 
output space y' = S with transition probabilities defined in ([8]) . 

2.4 Causal feedback encoders, sequential decoders, and main result 

Definition 3 A causal feedback encoder is the pair of a finite message set and a sequence 
of maps 

$ = {w, {0i : W X X 5* ^ A-}^^^) . (14) 

With Defini are implicitly assuming that perfect state knowledge as well as perfect output 
feedback are available at the encoder side. 

Given a stationary Markov channel and a causal feedback encoder as in Defi3l we will con- 
sider a probability space (fi,^, P<j>) (E$ will denote the corresponding expectation operator) 
over which are defined: 

• a W-valued random variable W describing the message to be transmitted; 

• a sequence X = [Xt)ten of A"- valued r.v.s (the channel input sequence); 

• a sequence Y = {Yt)ten of 3^- valued r.v.s (the channel output sequence); 

• a sequence S = {Stjten of 5- valued r.v.s (the state sequence). 
We shall consider the time ordering 

VF,5i,Xi,yi,52,X2,l2,... , 

and assume that 

P$(VF = u;) = , ¥q,{Si = s\W) = ^l{s) , 



\{Xt = x\W,SlXl-\Y^ 



t "v^t— 1 \^t — l\ _ ^ 



% — a.s. , 
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F^{St+i = s,Yt = x\W,Si,Yl-\xi) = P{s,y\St,Xt) , P$ - a.s. . 

It is convenient to introduce the fohowing notation for the information patterns available 
at the encoder and decoder side. For every t we define the sigma-fields 8t := (t {^S\,Y^~ ), 
describing the feedback information available at the encoder side, and Tt '■= a [Si,Yi^ , 
describing the information available at the decoder. Clearly 

{ia,n} = £o = ToC£^QJ^^C... CA. (15) 

In particular we end up with two nested filtrations: T := {J-'t)t£Z+ a^id £ := {£t)t<^z+- 

Definition 4 A transmission time T is a stopping time for the filtration T . 

Definition 5 Given a causal feedback encoder ^ as in ([T^ and a transmission time T , a 
sequential decoder for <I> and T is a W-valued J^T-i^^o-surahle random variable. 

Notice that with Def.s|l]and[5]we are assuming that perfect causal state knowledge is available 
at the receiver. In particular the fact that the transmitter's and the receiver's information 
patterns are nested guarantees that encoder and decoder stay synchronized while using a 
variable-length scheme. 

Given a causal feedback encoder <I> as in Def. [3l a transmission time T and a sequential 
decoder ^> , we will call the triple ('&,T, ^) a variable-length block-coding scheme and define 
its error probability as 

Following Burnashev's approach we shall consider the expected decoding time IE$[T] as a 
measure of the delay of the scheme ($, T, ^) and accordingly define its rate as 

log IWI 

We are now ready to state our main result. It is formulated in an asymptotic setting, 
considering countable families of causal encoders and sequential decoders with asymptotic 
average rate below capacity and vanishing error probability. 

Theorem 1 For any R in (0, C) 

1. any family (<!>", T„, of variable-length block-coding schemes such that 

limpe($",r„,*") =0, liminfi?($",r„,^'") >i?, (16) 

neN nGN 

satisfies 

hmsup-— ^logpe(^",T,,^") <i?i?(i?). (17) 

2. there exists a family (<^", T„, ^'")^gj^ of variable-length block-coding schemes satisfying 
MGi) and such that 



if D < +00 
if D = +00 



lim-_— logpe(^",T„,^") =i?B(i2), (18) 

n&i [1 n\ 



Pe(^",T„,^'") = 0, VnGN. (19) 



We observe that Burnashev's original result [5] for memoryless channels can be recovered as 
a particular case of Theorem [T] when the state space is trivial, i.e. |5| = 1. 
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3 An upper bound on the achievable error exponent 



The aim of this section is to provide an upper bound on the error exponent of an arbitrary 
variable-length block coding scheme. A first observation is that, without any loss of generality, 
we can restrict ourselves to the case when the Burnashev coefficient D is finite, since otherwise 
the claim ()17p is trivially true. The main result of this section is contained in Theorem [7] 
whose proof will pass through a series of intermediate steps, contained in Lemmas [2l [3l H] and 

El 

The main idea, taken from Burnashev's original paper (see also [17j and [2]) is to 
obtain two different upper bounds for the error probability. Differently from [5| and |17) . we 
will follow an approach similar to the one proposed in [2j and look at the behaviour of the 
a posteriori error probability, instead of that of the a posteriori entropy. The two bounds 
correspond to two distinct phases which can be recognized in any sequential transmission 
scheme and will be the content of Sections 13.11 and 13.21 The first one is provided in Lemma 
[3] whose proof is based on an application of Fano's inequality combined with a martingale 
argument invoking Doob's optional stopping theorem. The second bound is given by Theorem 
El whose proof combines the use of the log-sum inequality with another application of Doob's 
optional stopping theorem. In Section 13.31 these two bounds will be combined obtaining 
Theorem [71 which is a generalization to our setting of Burnashev's result [5]. 

3.1 A first bound on the error probability 

Suppose we are given a causal feedback encoder $ = (W, {(j)t)) as in ST^i and a transmission 
time T as in Def. [H Our goal is to find a lower bound for the error probability pe{^,T,^) 
where ^ is an arbitrary sequential decoder for <I> and T. 

It will be convenient to define for every t > the cj-algebra Qt '■= £t+i describing the 
encoder's feedback information at time t -|- 1. G := {Gt)tez+ will denote the corresponding 
filtration. We define the maximum a posteriori error probability conditioned on the cr-algebras 
Tt and Gt respectively by 

PMApit) ■■= 1 - max {P$ {W = w\Tt)} , pMApit) ■■= 1 - max {P$ {W = wlGt)} . 

Clearly P^{t) is an ^(-measurable random variable while P^{t) is ^(-measurable. 

It is a well known fact that the decoder minimizing the error probability over the class of 
fixed-length decoders : 5* x 3^* ^ W} is the maximum a posteriori one, defined by 

*MAp(s, y) = argmax {F^{W = w\Si = s,YI = y)} , s e S\ y e y\ 

(with the convention for the operator argmax to arbitrarily assign one of the optimizing 
values in case of non-uniqueness). It will be convenient to consider the larger class of decoders 
1^' : 5*"*"^ X 3^* ^ W} (differing from the previous one because of the possible dependence 
on the state at time t -|- 1); over such a class, the optimal decoder is given by 

^MAp{s,y) = argmax {P$(W^ = w\S\+^ = s,YI = y)} , se S'+\ y^yK 
It follows that for any ^ : x y^ ^ W we have 

Pe{^, t, ^) > pe{^, t, ^iiAp) > Pei'^, t, ^lAp) = ^■S> 

The discussion above naturally generalizes from the fixed length setting to the sequential 
one. In particular, given a stopping time T for the decoder filtration J^, we observe that, since 



^MApi't) 
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^ Qt for every t > 0, T is also stopping time for the filtration Q and Tj- Gt- It follows 
that the error probability of the scheme {^,T,^), where ^ is an arbitrary ^^-measurable W- 
valued r.v., is lower bounded by that corresponding to the sequential improved MAP decoder 
^MAP^ defined by 

^liAP ■= argmax{P$ {W = wIGt)} ■ 
Therefore we can conclude that 

Pe($,r,^)>E$[p*^p(T)] , (20) 

for any JF^^-measurable W- valued random variable ^. 

In the sequel we will lower bound the righthand side of (j20p . In particular, since the ran- 
dom variable W is uniformly distributed over the message set W, and since Si is independent 
of VF, we have that 

F^{W = w\go)=F^{W = w) = -^, weW, 



so that 



* .n^ _ iWl - 1 



\w\ 



Moreover we have the following recursive lower bound for PMAp{t) (see Proposition 2 in j2] 
for a similar result in the memoryless case). 

Lemma 2 Given any causal feedback encoder <I>, we have, for every t in N, 

PMApit)>>^PMAp{t-l) n-a.S. 

Proof See Appendix [Al □ 
For every 5 in (0, ^), we now consider the random variable 

Ts := min |r, inf |n e N : pMAp{t) < ■ (21) 

It is immediate to verify that is a stopping time for the filtration Q. Moreover the event 
^Pmap{ts) > implies the event {ts = T}, so that an application of Markov inequality and 
give us 

[Pmap(.ts) >(5) = P$ ({Pmap(t5) ><5} n {r^ = T}) 

< P$ (Pmap(T) ><5) 

< iE$ [Pmap{T) 

< ipe(^,T,vI/). 

We introduce the following notation for the a posteriori entropy 

Observe that, since Si is independent of the message W, then 

Po = log |W| , P$ - a.s. 
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It is easy to verify that, whenever Pmap{t5) ^ we have 

r., <H(5)+<51og|W|. 
Hence the expected value of F^-^ can be bounded from above as follows: 



E$ [r 



Trs\PMAp{T5) < 5 P$ {PmAp{t5) < S 



Trs Pmap{t5) > 5 P$ Pmap{t5) > S 



< (H(5) + 51oglW|)P$ {^Pmap{ts) < 5) +F^[Pmap{ts) > 6)log\W\ 

< H(5) + (5 + ipe(^,r,M/))log|W|. 



(22) 



We now introduce, for every time t in N, a 7'(Af)-valued random variable T^^t describing 
the channel input distribution induced by the encoder $ at time t: 



T$,t(x) ■.= FiXt = x\£t) 



t{W,Si,Yl~')=x\Si,Yl 



X £ X . 



(23) 



Notice that T$^t is £'j-measurable, i.e. equivalently it is a function of the pair {Si,Y^^^). 
The subscript in T$^j emphasizes the fact that this quantity depends on the encoder with 
no restriction on it but to be causal. 

The following result relates three relevant quantities characterizing the performances of 
any causal encoder sequential decoder pair: the cardinality of the message set W, the error 
probability of the encoder decoder pair, and the the mutual information cost c ([5]) up to the 
stopping time r^: 



Q($,r) :=E$ 



^c(S't,T$,t) 



t=i 



(24) 



Lemma 3 For any variable-length block-coding scheme (<I>,T, ^') and any < 5 < we 
have 

Cs{^, T)>(l-6- EA^iIiH) log I w| - H(<5) . (25) 



Proof See Appendix [XI 



□ 



3.2 A lower bound to the error probability of a composite binary hypoth- 
esis test 

We now consider a particular binary hypothesis testing problem which will arise while proving 
the main result. Suppose we are given a causal feedback encoder $ = (W, (</>*)). Consider a 
nontrivial binary partition of the message set 



W = Wo U Wi 



(26) 



and a sequential binary hypothesis test ^ = (T, ^) (where T is stopping time for the filtra- 
tion Q, and ^ is an ^r-measurable {0, l}-valued random variable) between the hypothesis 
{VF € Wo} and the hypothesis {VK G Wo}. Following the standard statistical terminology we 
call ^ a composite test since it must decide between two classes of probability laws for the 
process {S, Y) rather than between two single laws. For every t, we define the P(Af)-valued 
random variables j and T|, ^ by 



ri t{x) = P$ {Xt = x\W eW^,£t) , xeX, i = 0,l. 
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The r.v. TT^ J (respectively X|, ^) represents the channel input distribution at time t induced 
by the encoder $ when restricted to the message subset Wo (resp. Wi). Notice that 



■ <!>,* 



\{w e WQ\£t)r%, + F^{w G Wi\£t)ri,. 



Let T be another stopping time for the filtration Q, such that t <T. Let us consider the 
conditional expectation terms 



Li 



log 



qT+1 -yrT I 
'-'t+2 ' 't+1\ 



w G w,,gr 



0,1. 



Both Lq and Li are ^T-measurable random variables. In particular Lq equals the Kullback- 
Leibler information divergence between the conditioned probability distributions of the 
pair (^Sj^2 ?^t'+i^ respectively given {W G Wq} and {W G Wi}; an analogous interpretation 
is possible, mutati mutandis, for Li. 

In the special case when both r and T are deterministic constants, an application of 
the log-sum inequality would show that, for i = 0, 1, Li can be upperbounded by the Gr- 
conditional expected value of the sum of the information divergence costs d ^T|, ^, St^ from 

time r + 1 to T, and analogously for Li, with the terms d ^St, T|, ^ . It turns out that the 
same is true in our setting where both r and T are stopping times for the filtration Q, as 
stated in the following lemma, whose proof requires, besides an application of the log-sum 
inequality, a martingale argument invoking Doob's optional stopping theorem. 

Lemma 4 Let r and T be stopping times for the filtration Q such that t < T, and consider 
a partition of the message set as in 126\) . Then 



Li < E. 



.t=T+l 



St) 



a.s. 



0,1 



(27) 



Proof See Appendix [Al □ 
Suppose now that Wi is a t/r-measurable random variable taking values in 2^\{0, W}, the 

class of nontrivial proper subsets of the message set W. In other words, we are assuming that 

Wi is a random subset of the message set W, deterministic function of the pair {Sl~^^ ,Y^). 
The following result gives a lower bound on the error probability of the binary test ^ 

conditioned on the cr-algebra Qr in terms of the information divergence terms 



Ed 



T 



t=r+l 



0,1. 



Lemma 5 Let <I> be any causal encoder, and r and T be stopping times for the filtration Q 
such that T < T. Then, for every 2^ \ {%^W}-valued Qr-measurable r.v. Wi, we have that 
P* - a.s. 



Ed 



.t=T + l 



, z 

>log- 



logP ^T^ l{H'gWi}|6^r 



(28) 



where 



Z := min |p$ {W G Wo| Qr) , P$ (VF G Wi| Qr) } . 
Proof See Appendix [XI 



□ 
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3.3 Burnashev bound for Markov channels 



Lemma 6 Let ^ be a causal feedback encoder and T a transmission time for Then, for 
every < 5 < 1/2 there exists a Qr^-measurable random subset Wi of the message set W, 
whose a posteriori error probabilities satisfy 



1 - A(5 > P(VF G Wi\Gr,) > XS. 



(29) 



Proof See Appendix El □ 
To a causal encoder $ and a transmission time T, for every < 6 < 1/2 we define the 
quantity 

T 



Ds{'^,T):= max E 

Wi Grg —measurable 

2^ —valued r.v. 
A<5<P${H^eWi|er^)<l-A5 



t=TS + l 



(30) 



The quantity Ds{^, T) equals the maximum, over all possible choices of a nontrivial partition 
of the message set W as a function of the joint channel state output process {S^^^ ,Y^^) 
stopped at the intermediate time ts, of the averaged sum of the information divergence costs 
d (S't,T^{^6>vi>) incurred between times Ts + 1 and T. Intuitively Ds{^,T) measures the 
maximum error exponent achievable by the encoder <I> when transmitting a binary message 
between times ts and T. 

Based on Lemma [3] and Lemma [5l we will now prove the main result of this section, 
consisting in an upper bound on the largest error exponent achievable by variable-length 
block-coding schemes with perfect causal state knowledge and output feedback. 

Theorem 7 Consider a variable-length block-coding scheme (<^,T, ^f). Then, for every 5 G 
- logpe T, *) < ^Cs{1>, T) + Ds{^, T) - ^ log \W\ {l-a)-(3, (31) 



where 



a := 6 + 



C 

Pe{^,T,^) 



(5 := H 



C 



X5 D 



C 



H(5). 



Proof Let Wi be a ^^-^ -measurable subset of the message space W satisfying (j29|) . We define 
the binary sequential decoder 

§5 := Iwi {^) ■ 

Notice that the definition above is consistent in the sense that ^ is ^r-measurable, since ^ 
is t/y-measurable, while Wi is ^^-^ -measurable and Q Qt- 

We can lower bound the error probability of the composite hypothesis test conditioned 
on Q^-g using Lemma [5] and (p9]) , obtaining 



logP$ §5/lwi(VF)|e?,J+logM < 



t=TS+l ^ 

It is clear that the error event of the pair ^ is implied by the error event of ^,5. It follows 
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that 



logpe ($,T,^) +log — 



< 



■iogE$ [p {^^, ^w\gr,)]+\og — 

■ log E., [P (Iwi (*) / Iwi (VF) I Grs)] + log Y 



— log E$ 



< E 



logP$ §5 / IWiWla 



T5 



+ log ■ ^ 
+ log 



(32) 



< E, 



Efl 



t=T5 + l 



{wewi} 



the second inequality in (j32p following from Chebychev inequality. The claim now follows by 
taking a linear combination of p2l) and (I25p . ■ 



In the memoryless case, i.e. when the state space is trivial (|5| = 1), Burnashev's original 
result (see (4.1) in [5], see also (12) in [2j) can be recovered from ()3ip by optimizing over the 
channel input distributions T$ t, T^j, and T|,^. 

In order to prove Part 1 of Theorem [T] it remains to consider countable families of variable- 
length coding schemes with vanishing error probability and to show that asymptotically the 
upper bound in ([3T]l reduces to the Burnashev exponent Eb{R)- This involves new technical 
challenges which will be the object of next section. 



4 Markov decision problems with stopping time horizons 

In this section we shall recall some concepts about Markov decision processes which will allow 
us to asymptotically estimate the terms Cs{^-,T) and Ds{^.,T) respectively in terms of the 
capacity C defined in ([6]) and the Burnashev coefficient D (I12p of the FSMC. 

The main idea is to interpret the maximization of Ci{^,T) and Ds{^,T) as stochastic 
control problems with average cost criterion [1] . The control is the channel input distribution 
chosen as a function of the available feedback information and the controller is identified 
with the encoder. The main novelty these problems have with respect to those traditionally 
addressed by Markov decision theory consists in the fact that, as a consequence of considering 
variable-length coding schemes, we shall need to deal with the situation when the horizon is 
neither finite (in the sense of being a deterministic constant) nor infinite (in the sense of being 
concerned with the asymptotic normalized average running cost), but rather it is allowed to 
be a random stopping time. In order to handle this case we adopt the convex analytical 
approach, a technique first introduced by Manne in [18j (see also [9]) for the finite state finite 
action setting, and later developed in great generality by Borkar [4j. 

In Section l4.ll we shall first reformulate the problem of optimizing the terms Cs{^,T) 
and Ds{^,T) with respect to the causal encoder Then, we present a brief review of the 
convex analytical approach to Markov decision problems in Section 14.21 presenting the main 
ideas and definitions. In Section 14.31 we will prove a uniform convergence theorem for the 
empirical measure process and use this result to treat the asymptotic case of the average cost 
problem with stopping time horizon. The main result of this section is contained in Theorem 
[TT| which is then applied in Section 14.41 together with Theorem [3 in order to prove Part 1 of 
Theorem [TJ 
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4.1 Markov decision problems with stopping time horizons 



We shall consider a controlled Markov chain over S, with compact control space U := V{X), 
the space of channel input distributions. Let g:5x^^Rbea continuous (and thus 
bounded) cost function; in our application g will coincide either with the mutual information 
cost c defined in ([5]) or with the information divergence cost d defined in Q. We prefer to 
consider the general case in order to deal with both problems at once. 

The evolution of the system is described by a state sequence S = (S'j), an output sequence 
1^ = {Yt) and a control sequence U = (Uf). If at time t the system is in state St = s in S, 
and a control Ut = u in U is chosen according to some policy, then a cost g{s, u) is incurred 
and the system produces the output Yt = y in y and moves to next state St+i = s+ in 5 
according to the stochastic kernel Q(s+, ?/| s,u), defined in ([3]). Once the transition into next 
state has occurred, a new action is chosen and the process is repeated. 

At time t, the control Ut is allowed to be an £'t-measurable random variable, where 
£t = a{Si,Y^~^) is the encoder's feedback information pattern at time t; in other words we 
are assuming that Ut = vr^ (S*, Y^^^^ for some map 

TTt : 5* X ^U. 

We define a feasible policy tt as a sequence (7rt)jgN of such maps. Once a feasible policy tt 
has been chosen, a joint probability distribution for state, control and output sequences 
is well defined; we will denote by the corresponding expectation operator. 

Let r be a stopping time for the filtration G = (Qt) (recall that Qt = £t+i describes 
the encoder's feedback and state information at time t + 1), and consider the following 
optimization problem: maximize 



^ E. 



E„[r] 



Y,9{St,7rtiSi,Yt')) 



t=i 



(33) 



over all feasible policies tt = (vr^) such that E^[r] is finite. 

Clearly, in the special case when r is a constant (j33p reduces to the standard finite-horizon 
problem which is usually solved with dynamic programming tools. Another special case is 
when r is geometrically distributed and independent from the processes S, U and Y. In this 
case (p3]l reduces to the so-called discounted problem which has been widely studied in the 
stochastic control literature [1]. However, what makes the problem nonstandard is that in 
(j33p r is allowed to be an arbitrary stopping time for the filtration generally correlated 
with the processes S, U and Y. 



4.2 The convex analytical approach 

We review some of the ideas of the convex-analytical approach following [4]. 

A feasible policy tt is said to be stationary if the current control depends on the current 
state only and is independent of the past state and output history and of the time, i.e. there 
exists a map tt : S ^ U such that TTt{s\,y\~^) = vr(st) for all t. We will identify a stationary 
policy as above with the map tt : S ^U. It has already been noted in Section [2.11 that, for 
every stationary policy vr, the stochastic matrix Qtt as defined in ([Ij) is irreducible, so that 
existence and uniqueness of an invariant measure /lyr in T^iS) are guaranteed. It follows that, 

n 

if a stationary policy vr is used, then the normalized running cost ^ ^ g{St,7r{St)) converges 

t=i 
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Figure 2: A schematic representation of the optimization problem ()40p . The large triangular 
space is the infinite dimensional Prohorov space V{S x U). Its gray-shaped subset represents 
the close convex set K of all occupation measures. The set of extreme points of K is Kg 
and corresponds to the set of all occupation measures associated to stationary deterministic 
policies. The optimal value of the linear functional r] {ij^g) happens to be achieved on Kg 
and thus corresponds to the occupation measure ij* associated to an optimal deterministic 
stationary policy vr* : 5 — > V{X). 

P^-almost surely to lJ-Kis)g{s,7r{s)). Define 

seS 

G := max V'/i7r(s)5(s,7r(s)) . (34) 

n:S — >Vi — 

Observe that the optimization in the righthand side of (j34p has the same form of those in the 
definitions dS]) and (I12p of the capacity and the Burnashev coefficient of an ergodic FSMC 
given in Section [2l Notice that compactness of the space U'^ of all stationary policies and 
continuity of the cost g{s, vr(s)) and of the invariant measure /x^r as functions of the stationary 
policy TT guarantee the existence of an optimal value in the above maximization. 

We now consider stationary randomized policies. These are defined as maps tt : 5 — s- V(IA), 
where V{U) denotes the space of probability measures on U, equipped with its Prohorov topol- 
ogy [3]. To any stationary randomized policy vr the following control strategy is associated: 
if at time t the state is St = s, then the control Ut is randomly chosen in the control space 
U with conditional distribution given the available information £t = (^{Si,Y^~^) equal to 
7r(s). Observe that there are two levels of randomization. The control space itself has al- 
ready been defined as the space of channel input probability distributions V{X)^ while the 
strategy associated to the stationary randomized policy vf chooses a control at random with 
conditional distribution Ti{St) in ViU) = V{V{X)). Of course randomized stationary policies 
are a generalization of deterministic stationary policies, since to any deterministic stationary 
policy vr : 5 — > it is possible to associate the randomized policy 7r(s) = To any 

randomized stationary policy fr : 5 ^ ^(^) we associate the stochastic matrix describing the 
associated state transition probabilities 

{Qi{s+\s))^ g , Qi{s+\s) := Q{s+\s,u)[k]{s){du) . (35) 

Ju 

Similarly to the case of stationary deterministic policies, it is not difficult to conclude that, 
since Qj^ can be written as a convex combination of a finite number of stochastic matrices 
Qf, with / : 5 — s- A", all of which are irreducible, then itself is irreducible and thus admits 
a unique state ergodic measure ^i^ in V{S). This motivates the following definition. 
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Definition 6 For every stationary (randomized) policy n : S ^ ^(^) the occupation mea- 
sure of IT is rjj^ inV{S x U) defined by 

{r]^,h)= / h{s,u)dri^{s,u) = y2tJ'7Tis) / /i(s, u)d7r(s) , M h ^ Cb{S x U) , 
JsxU Ju 

where fij^ in T'{S) is the invariant measure of the stochastic matrix Qtt; while Cb{S x U) is 
the space of bounded continuous maps from S x U to M. 

The occupation measure 7],^ can be viewed as the long-time empirical frequency of the joint 
state-control process governed by the stationary (randomized) policy vr. In fact, for every 
time n in N, we can associate to the controlled Markov process the empirical measure Vn 
which is a V{S x Z//)-valued random variable sample-path-wise defined by 

1 " 

{vn,h) :=-yh{St,Ut), yheCbiSxU). (36) 

Observe that is a probability measure on the product space S xlA, and is itself a random 
variable since it is defined as a function of the joint state control random process (S*, U\). 
Then, it can be verified that, if the process is controlled by a stationary (randomized) policy 
vr, then 

lim Vn = Ptt — o,.s. (37) 

neN 

We will denote by K the set of the occupation measures associated to all the stationary 
randomized policies, i.e. 

K := {r)^\^ -.S ^V{U)} <ZV{S xU), (38) 

and by the set of all occupation measures associated to stationary deterministic policies 

Ke:={'n^\-K -.S ^U} V{S X U) . 

Well known results (see [4j) show that both K and K^, are closed subsets of V{S x lA). 
Moreover K is convex and K^, coincides with the set of extreme points of K. Furthermore it 
is possible to characterize K as the the set of zeros of the continuous linear functional 

F:V[SxU)^[0,lf , Fs{rj):=r,{{s]M)- j Qs{s\j,u)dri{j,u) , 

SxU 

i.e. 

K = {r] ^V{S xU) : Fir}) = 0} . (39) 

In fact it is possible to think of ||i^(T7)|| (here and throughout the paper := maxj \xi\ 
will denote the L°°-norm of a vector x) as a measure of how far the 5-marginal of a measure 
T] in V{S X lA) is from being invariant for the state process. 

If one were interested in optimizing the infinite-horizon running average cost 



lim inf — E,, 

ngN n 



t=i 



lim inf [{Vn,g)] 



over all (randomized) stationary policies vr, then (I37p and (I38p would immediately lead to the 
following convex optimization problem: 

max(r/, g) . (40) 
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In fact, using (j39|) . (j4U|) can be rewritten as an infinite dimensional linear programming 
problem 

max (r/, c) . (41) 
■nePiSxuy. 

We notice that, since U is compact and S is finite, ViS x U) is compact. Thus both K and 
are compact. It follows that, since the map 

V{S xU)3t]i — > {'n,g) eR 

is continuous, it achieves its maxima both on K and Kg] moreover, the same map is linear 
so that these maxima do coincide, i.e. the maximum over K is achieved in an extreme point. 
Thus we have the following chain of equalities 

G = max ^ /i^(s)5'(s,7r(s)) 

= max {r]^,g) 

= max{ri,g) 

v&K, (42) 
= max(?7,c/) 

= max {rj, c) . 

F{V)=0 

We observe that the last term in (j42|) both the constraints and the object functionals are 
linear. This indicates (infinite dimensional) linear programming as a possible approach for 
computing G, alternative to the dynamic programming ones based on policy or value iteration 
techniques [I], [1]. Moreover, it shows an easy way to generalize the theory taking into account 
average cost constraints (see where the Burnashev exponent of DMCs with average cost 
constraints is studied). In fact, in the convex analytical approach these constraints merely 
translate into additional constraints for the linear program. 



4.3 An asymptotic solution to Markov decision problems with a stopping 
time horizon 

It is known that, under the ergodicity and continuity assumptions we have made, G defined 
in (j34p is the sample-path optimal value for the infinite horizon problem with cost g not 
only over the set of all stationary policies, but also over the larger set of all feasible policies 
(actually over all admissible policies, see |4j). This means that, for every feasible policy 

TT = (vTt), 



lunsnp -J2 9{St, MSlYl-')) <G, 



a.s. 



(43) 



t=i 



Moreover, it is a known fact that for an arbitrary sequence of policies (tt") we have 



lim sup — E^n 

nGN n 



lim sup — Ett" 

neN n 



t=i 



< G. 



(44) 



i.e. the limit of the optimal values of finite horizon problems coincides with infinite horizon 
optimal value. (j44p can be proved by using dynamic programming arguments based on 
Bellman principle of optimality. As shown in [27J, (j44|) is useful in characterizing the capacity 
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of channels with memory and feedback with fixed- length codes. Actually, a much more general 
result than (j33|) can be proved, as explained in the sequel. 

In the convex analytical approach, the key point in proof of (|43|) consists in showing that, 
under any -generally non stationary- feasible policy tt, the empirical measure process (n„) 
as defined in (j36p converges P^-almost surely to the set K. The way this is usually proven 
is by using a martingale central limit theorem in order to show that the finite-dimensional 
process F{vn) converges to almost surely. The following is a stronger result, providing an 
exponential upper bound on the tails of the random sequence this bound being 

uniform with respect to the choice of the policy tt in 11. 



Lemma 8 For every e > 0, and for every feasible policy tt 



[\\F{vn)\\>e+'^ < 2|5|exp(-neV2) 



(45) 



Proof See Appendix [Bl □ 
We emphasize the fact that the bound (|45|) is uniform with respect to the choice of the 
feasible policy tt. It is now possible to drive conclusions on the tails of the running average 
cost lT.'i=i9{St,Ut) based on (j45]). The core idea is the following. By the definition of the 
empirical measure we can rewrite the normalized running cost as 



n 

-y^g{SuUt) = {vn,g). 

n ^ — ^ 



Since the map t] (^7,5) is continuous over V{S xU), and G = max{(r/,5r)| 77 € K}, we have 
that, whenever Vn is close to the set K, {vn,g) cannot be much larger than G. It follows that, 
if with high probability Vn is close enough to K, then with high probability {vn,g) cannot be 
much larger than G. In order to show that with high probability Vn is close to K, we want 
to use (jlSl) . In fact, if for some x in V{S x U) the quantity ||F(x)|| is very small, then x is 
necessarily close to G. More precisely, we define the function 

7 : M+ ^ M, 7(x) := sup { (r/, 5) | 77 e V{S xU) : I|-F(r7)[| < x} . 

Clearly 7 is nondecreasing and 7(0) = G. Moreover we have the following result. 

Lemma 9 The map 7 is upper semicontinuous. (i.e. x„ — > x ^ limsup„7(x„) < 7(x)J 
Proof See Appendix [Bj ■ 
For every A; in N we now introduce the random process (G^) 

G'^:=sup{vn,g) , nGN. 

t>n 

Clearly the process (G^) is samplepathwise non increasing in n. 

Lemma 10 Let {t}^) he a sequence of stopping times for the filtration T and (tv^) be a 
sequence of feasible policies such that E^fe[Tfc] < 00 for every k and 

limP_fe (Tfc < M) = , VM G N . (46) 
fceN 



Then 

fcGN 



hmP^. (Gj:^>7(e))=0, Ve>0. (47) 
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Proof See Appendix [Bj □ 
The following result can be considered as an asymptotic estimate of ()33p . It consists in a 
generalization of (j44|) from a deterministic increasing sequence of time horizons to a sequence 
of stopping times satisfying the 'probabilistic divergence' requirement (j46p . 



Theorem 11 Let {t^) be a sequence of stopping times for the filtration T and (vr'^) be a 
sequence of feasible policies such that E^fc[rfc] < oo for every k, and ( [^6p holds true. Then 



limsup \ 



,t=i 



< G. 



(48) 



Proof Let us fix an arbitrary e > 0. By applying Lemma \T0\ we obtain 



i:giSt,Ut) 
t=i 



[Tk] 



E^k [Tk{Vrk,g)] 



E, 



Tk (Vr, , g) 1{G^^ <7(e)} J E„. [Tfc (t;,, , c) I > 7(e)] 



+ 



E^. [Tk] E^fe [Tk] 

<7(^^)+5maxIP^. (G^^ >7(e)) , 
where Qmax '■= max{g'(s,ii)| s G G Z//}. From ([371) we get 



.fc I G^^ > 7(e; 



lim sup — — - — r 

ken E^fe [Tk] 



E, 



t=i 



< 7(e) + 5max limsupP^fe {Gr^ > 7(e)) = 7(e) . 

fcgN 



Therefore (j48p follows from the arbitrariness of e > 0, and the fact that, as a consequence of 
Lemma [9l we have 

lim 7(e) = G . 



4.4 Proof of Part 1 of Theorem [T] 

We are now ready to step back to the problem of upperbounding the error exponent of 
variable-length block-coding schemes over FSMCs. We want to combine the result in Theorem 
[7] with that in Theorem [11] in order to finally prove Part 1 of Theorem [TJ 

Let ,Tk,^^) be a sequence of variable-length block coding schemes satisfying (|16p . 
Our goal is to prove that 

limsup — 7— <D[l- — ] . (49) 



feeN E,j,fc[Tfc] \ G 

A first simple conclusion that can be drawn from Theorem [TJ using the crude bounds 
c(T$fci, St) <log\X[, d{St, T*^fc J < dmax , i = 0, 1 , 

is that 

-logpe D , , , , X6 D , ^ 

limsup < ^ log 1^1 + d^ax - Ril - 5) - log — + - R{6) < +00 . (50) 
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Thus the error probabihty does not decay to zero faster than exponentially with the expected 
transmission time E^fc[rfc]. 

The core idea to prove ()49p consists in introducing a real sequence {6k) and showing that 
both 



Tfc := mm 



|rfc,inf I 



t^^PMAP 



{t)<Sk}} , 



and Tk — diverge in the sense of satisfying (f46l) . The sequence {6k) needs to be carefully 
chosen: we want it to be asymptotically vanishing in order to guarantee that diverges, but 
not too fast since otherwise — would not diverge. It turns out that one possible good 
choice is 



-1 



logPe($^Tfc,^'=) 



It is immediate to verify that 
implies 



fcgN 



lim 6k = 0, lim ^ — — ^ = . 

keN keN 6k 



(51) 



Lemma 12 In the previous setting, for every fixed M in N, we have 



limPfl,ft (Tfc < M) = 0. 



limP^fe {Tk-Tk<M) = 0. 



(52) 



Proof See Appendix [Bl □ 
Lemma [T2] allows us to apply Theorem [11] first to the mutual information cost c obtaining 



V Cs,{<^^Tk) 
hm sup — = — - — - — = lim sup ■ 



E. 



Tfc 



c{St,T^^t) 

t=i 



< c 



IE$fc [Tk] 

and then to the information divergence cost d obtaining 

Ds,{^k,Tk) ^ ^ 
hmsup — ^y- r < D . 

keN Jc.^* [-Lk — Tk\ 

Therefore, by applying Theorem [7| we get 

D > limsnp-^ (^Cs,{^\Tk)+Ds,{'^',Tk)) 
ken i^q>k[J-k\ \^ J 

> ii^.^p -logPe(^',rfc,^fe) ^ Z^ loglWfcl ^^ ^ 15k 

~ km [Tfc] C E^fc [Tfc] E$fc [Tfc] 

-logPe(^',Tfc,^fe) 

= + hmsup 

^ k&i ^^k\^k\ 



thus proving (fT7|) . 
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Transmission Phase 



Confirmation Pliase 



fy n1 



L(1-y)nJ 



Figure 3: One epoch in the generahzed Yamamoto-Itoh scheme: a total length n is divided 
into two phases: a transmission one of length h = \^n] and a confirmation one of length 
n = [(1 — 7)nJ . 

5 An asymptotically optimal scheme 

In this section we propose and analyze a family of causal coding schemes with feedback 
asymptotically achieving the Burnashev exponent Eb{R), thus proving Part 2 of Theorem [TJ 

The scheme we propose can be viewed as a generalization of the one introduced by Ya- 
mamoto and Itoh in [33j and consists of a sequence of epochs. Each epoch is made up of two 
distinct fixed- length transmission phases, respectively named communication and confirma- 
tion phase. In the communication phase the message to be sent is encoded in a block code 
and transmitted over the channel. At the end of this phase the decoder makes a tentative 
decision about the message sent based on the observation of the channel outputs and of the 
state sequence. As perfect causal feedback is available at the encoder, the result of this de- 
cision is known at the encoder. In the confirmation phase a binary acknowledge message, 
confirming the decoder's estimation if it is correct, or denying it when it is wrong, is sent by 
the encoder through a fixed-length repetition code-function. The decoder performs a binary 
hypothesis test in order to decide whether a deny or a confirmation message has been sent. 
If an acknowledge is detected the transmission halts, while if a deny is detected the system 
restarts transmitting the same message with the same protocol. Again because of perfect 
feedback availability at the encoder, there are no synchronization problems. 

More precisely we design our scheme as follows. Given a design rate R in (0, C), let 
us fix an arbitrary 7 in (^, 1). For every n in N, consider a message set Wn of cardinality 
\Wn\ = exp{lnR\) and two blocklengths n and h respectively defined as n = [717] , h := n — h. 

Fixed-length block-coding for the transmission phase 

It is known from previous works (see for instance) that the capacity C of the stationary 
Markov channel we are considering is achievable by fixed-length block-coding schemes. Thus, 
since the rate of the first transmission phase is below capacity. 



R := lim 



log |W, 



n 




ngN 



n 



there exists a family of causal encoders (^>") parametrized by an index n in N 





)t-i 



X 
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with error probability asymptotically vanishing in n. More precisely, since the state space 
S is finite, the pair can be designed in such a way that the probability P|,„('i^" 7^ 

= w, Si = s) of error conditioned on the transmission of any message w in VV„ and of 
an initial state s approaches zero uniformly with respect both to w and s, i.e. 

p{n) := max maxP*„ f^" ^W\W = w,Si = s) . (53) 

The triple (^>", n, ^''^) will be used in the first phase of each epoch of our iterative transmission 
scheme. □ 

Binary hypothesis test for the confirmation phase 

For the second phase, instead, we consider a causal binary input encoder based on the 
optimal stationary policies in the maximization problem (jl3p . More specifically, we define 
!>" by 

(^r:{0,l}x5*^;f, (^rK^) = /m(st), m = 0,l, t=l,...,h, 

where Jq, fi : S ^ X are such that 

D = Y^ t^foi^)D {Pi-,- \s, fois))m • , • hm 

ses 

Suppose that an acknowledge message m = is sent. Then it is easy to verify that the 
pair sequence {St+i,Yt)2=i forms a Markov chain over the space of the achievable channel 
state output pairs 

Z := {{s+,y) gS xy s.t. 3s e S,3x e X : P{s+,y\s,x) > O} (54) 
with transition probability matrix 

Po= (^'o(s+,y|s,y„) := P(s+,y|s,/o(s))) . 

Analogously, if a deny message m = 1 has been sent, then {St+i,Yt)f^i forms a Markov chain 
with transition probability matrix 

Pi = (Pi(s+,y|s,y_) := P{s+,y\sJ^{s))') . 

It follows that a decoder for is a binary hypothesis test between two Markov chain hypothe- 
sis. Notice that for both chains the transition probabilities Po('5+, y-) and Pi{s+,y\s, y_) 
respectively do not depend on the second component y_ of the past state only, but only on 
its first component s as well as on the full future state (s+,y). 

When the coefficient D is finite, as a consequence of Assumption [2] and (jlip . we have 
that both the stochastic matrices Pq and Pi are both irreducible over Z, with the invariant 
measure of Pi given by 

iiiGV{Z), p,i{s+,y):='^fXf^{s)P{s+,y\s,fi{s)), i = 0, 1 . 

Using binary hypothesis test results for irreducible Markov chains (see [2D] and [H pagg. 72-82]) 
it is possible to show that a decoder 

^" : {s X yf-' ^ {0, 1} 
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can be chosen in such a way that, asymptotically in n, its type 1 error probability achieves 
the exponent (recall p^ ) 

~ f \D f I M P'o{z)Po{z+\z) sr^ ~ f I P{s+,y\s,fQ{s)) 

Y, Mz)Po{z^\ z) log -^(^)p^(^^i^) = E /^o(^, y| ^, /o (^)) log 

= Y.n,{s)D{p{. , .\sj*,{s))\\p{. , .\s,nm 

= D 

while its type one error probability is vanishing. More specifically, since the state space is 
finite, we have that, defining Pi{n) as the maximum over all possible initial states of the error 
probability of the pair (<!>" , ^'") conditioned on the transmission of a 'i' confirmation message, 
i.e. 

Pi{n) := maxP|, (Sf , Y^^-^) ^ i\W = i, Si = s) , i = 0, 1 , 

we have 

lim po(n) = , lim = jj _ (55) 

When the coefficient D is infinite, then the stochastic matrix Pq is irreducible over Z, 
while there exists at least a pair z, z+ in Z such that Pq{z^\z^) > while Pi{z^\z-) = 0. It 
follows that a sequence of binary tests (^"'), with : (S x y)^^^ {0, 1}, can be designed 
such that 

lim ("•) = , pi(?7,)=0, n G N . (56) 
neN 

Such a family of tests is given for instance by allowing ^'"(2;) to equal if and only if the 
(n — l)-tuple z contains a symbol Z- followed by a 2;+. 

□ 

Once fixed "3/", and the iterative protocol described above defines a variable- 
length block-coding scheme ($", T^, As mentioned above the scheme consists of a se- 
quence of epochs, each of fixed length n; in particular we have 

where 

:= inf {fc G N : ^'"(5f^i)„+^+„ y(t,)„+^^i) = o} , 

is a positive integers valued random variable describing the number of epochs occurred until 
transmission halts. 

The following result characterizes the asymptotic performances of the family of schemes 

(^.",r„,^'"). 

Proposition 13 For every design rate R in (0,C), and 7 in (0,C), we have 

lim = R (57) 

n&i E$n [Tn] 

and 

• if D < +00 

li„,zi5i^|(f^Zi^ = 0(l_,), (58) 



24 



• if D = +00 

Pe(^",?;,^n) =0, n€N. (59) 
Proof We introduce the following notation. First, for every /c E N: 

• ik := {^'('S'|^_^j^^;^, 7^ W} is the error event of the first transmission phase 
of the k-th. epoch; 

• ek := {^('S'()!'_^)„_^^_^2' ^(fc"T)n+n+i) 7^ ^efe} is the error event of the second transmission 
phase of the k-th epoch; 

Clearly we have 

P#n {ek\J='(^k-i)n) < P{n) , {ek\^(k-i)n+h) < Pie, (^) • 

The transmission halts the first time a confirmation is detected at the end of the second phase, 
i.e. the first time either a correct transmission in the first phase is followed by a successful 
transmission of an acknowledge message in the second phase, or an uncorrect transmission 
in the first phase is followed by a misreceived transmission of a deny message in the second 
phase. It follows that we can rewrite Tn as 

r„ = inf {A: G N s.t. (e^ n e^) U ((efc)' n (cfc)^)} . 

We claim that 

F$n(T„ >k)< {p{n) + poin))''-^ . (60) 

Indeed (j60p can be shown by induction. It is clearly true for k = I. Suppose it is true for 
some k in N; then 

P$n(r„ > A; + 1) = ¥^n{Tn>k + l|r„ > A;)P$n(r„ > k) 

= (P$n(efc+i) (1 -P$n(efc+i|efc)) + (1 - P$n(efc+i)) Pci,n(efc+i| (efc)^))P^,n(T„ > A;) 

< {p{n)+po{n))F^n{Tn>k) 

< {p{n) +Po{n))'' . 

Thus Tn is stochastically dominated by the sum of a constant 1 plus a r.v. with geometric 
distribution of parameter p{n) + pQ{n). It follows that its expected value can be bounded as 
follows 

1 < [Tn] = gP*" [Tn ^ ^ E (^^^ + Poin))'-' < ^ _ ^^(^)) • 



Hence, from (l53l) and (1551) we have 



limE$n[rJ = l. (61) 

neN 



From (jGip it immediately follows that 



limlgg#'=lim^"gy["^^))=i^. 
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Moreover, transmission ends with an error if and only if an error happens in the first trans- 
mission phase followed by a type-1 error in the second phase, so that, the error probability 
of the overall scheme T„, ^'") can be bounded as follows 

= E I"*" (e* n et n {t„ = t}) 
t>i 

= i: (et n n K > t}) 

t>l 

= Yl IPk" (e* n et I {Tn > t}) P$n (r„ > t) 
t>i 

t>i 
p{n)pi{n) 



(62) 



1 -p{n)po{n) ■ 



< 

When D is infinite, ([62]) directly imphes (f59|) . When D is finite from ([53|) . ([55|) . (f6T]) and 
(162)1 it follows that 



. -logPe(^",T„,^-) ^. . -logPe(^",T„,^'") 

hm ml ; — ; = lim mf ■ 



liminf (1 — 

neN 

= ^(1-7) , 

which proves 



> liminf (1 — p{n)pQ{n)) { — ^°^P^^^^ _|_ i. n _ pUApJn)) 
neN \ n n 



It is clear that (jlSp follows from (|58p and the arbitrariness of 7 in (^,1) , so that Part 2 
of Theorem [T] is proved. 

We end this section with the following observation. It follows from (|60p that the prob- 
ability that the proposed transmission scheme halts after more than one epoch is bounded 
by p{n) + pQ{n), a term which is vanishing asymptotically with n. Then, even if the trans- 
mission time is variable, it is constant with high probability. As also observed in [17] in the 
memoryless case, this is a desirable property from a practical point of view. 

6 An example 

We consider a FSMC as in FigHJ with state space S = {G,B}, input and output spaces 
X = y = {0, 1} and stochastic kernel given by: 

P{s+,y\s,x) = Ps{s+\x,s)PY(y\x,s) , s,s+e5, x,y£{0,l}, 

Ps{B\G,0)=ao Ps{B\G,l)=ai Ps{G\B,0) = Po Ps{G\B,l) = Pi , 

Py(l[G,0) =Py(0|G,l) =PG, Py{1\B,0)=Py{0\B,1)=pb, 

where < pc < Pb < and ao, ai, /So; /3i £ (0)1)- For any stationary policy ir : S —i- 
7^({0, 1}), the state invariant measure associated to vr can be made explicit: 

„ (r.. ao[n{G)m + aMGm) = 1 _ » ( B) 

^-^""^ ao[vr(G)](0) + ai[^(G)](l)+/3o[7r(i?)](0)+/3i[^(i?)](l)' ^^^"^^ 
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Figure 4: A simple FSMC with binary state space S = {G,B} and binary input/output 
space X = y = {0, 1}: notice that the state transition probabihties are allowed to depend on 
the current input (ISI) 




Figure 5: In the righthand picture, the capacity of the FSMC in Fig. U] for values of the 
parameters pc = 0.001, pB = 0.1, ao = 1 — Po = 0.7, ai = 1 — /3i = 7, is plotted as a function 
of 7 in (0, 1). In the righthand picture, for the same values of the parameters, the optimal 
policy TT* : {G, B} — > V{{0, 1}) is plotted as a function of 7 in (0, 1). 

The mutual information costs are given by 

c(G, n) = H (u(l)ai + n(0)ao)+H (n(l)pG + n(0)(l - pg))-B{pg)-{ug H(ai) + tx(0) H(ao)) , 

c{B, u) = R {u{l)pi + ^/(0)/3o)+H {u{l)pB + n(0)(l - pB))-H(pB)-(n(l) H(/3i) + n(0) H(/3o)) , 

H denoting the binary entropy function. The information divergence costs instead are given 
by 

d{G,Sf,^G)) = D {pg\\1 - Pg) + D{af^^G)\\af,(G)) , 

d{B,6^o^) = D {pb\\1 -pb) +D(a/„(G)||a/i(G)) , 

where, for x, y in [0, 1], D{x\\y) := xlog | + (1 — x) log 

In Fig. [5] and Figl6]the special case when pG = 0.001, pB = 0.1, oq = 1 — /3o = 0.7 and 
ai = 1 — /3i = 7 is studied as a function of the parameter 7 in (0, 1). In particular in FigH] 
the capacity and the optimal policy tt : S ^ X are plotted as a function of 7. Notice that 
for 7 = 0.7 the channel has no ISI and actually coincides with a memoryless Gilbert-Elliot 
channel: for that value the optimal policy chooses the uniform distribution both in the good 
state G as well as in the bad state B. For values of 7 below 0.7 (resp. beyond 0.7), instead, 
the optimal policy puts more mass on the input symbol 1 (resp. the symbol 0) both in state 
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Figure 6: The thick sohd hne is a plot of the Burnashev coefficient D (evaluated with natural 
log base) of the FSMC of FigH] for the same values of the parameters as in FigjSj 

G and state and it is more unbalanced in state B. In Figl6] the Burnashev coefficient 
of the channel is plotted as a function of the parameter 7, as well as the the values of the 
ergodic Kullback-Leibler cost corresponding to the four possible policies /o : {G, B} — > {0, 1}. 
Observe as the minimum value of D is achieved for 7 = 0.7; in that case all the four non 
trivial policies /o, /i give the same value of the Kullbak-Leibler cost. 

Finally it is worth to consider the simple non-ISI case when uq = ai = (3q = (3i. In this 
case the state ergodic measure is the uniform one on {G, i?}. Notice by a basic convexity 
argument we get that its capacity C and Burnashev coefficient D satisfy 

C = l-\ ^{pg) - \ ^{vb) > 1 - ii{\pG + \pb) =: C , (63) 

D = ^D{pg\\1 -pg) + ^D{pb\\1 -pb) > D{^pG + - ^Pb - ^Pg) =■ D. (64) 

In the (j63p and (j64p C and D correspond respectively to the capacity and the Burnashev 
coefficient of memoryless binary symmetric channel with crossover probability equal to the 
ergodic average of the crossover probabilities pb and pg- Such a channel is introduced in 
practice when channel interleavers are used in order to apply to FSMCs coding techniques 
designed for DMCs. While this approach reduces the decoding complexity, it is well known 
that it reduces the achievable capacity (163]) (see [13] ) . Inequality shows that this approach 
causes also a loss in the Burnashev coefficient of the channel. 

7 Conclusions 

In this paper we studied the error exponent of FSMCs with feedback. We have proved an 
exact single-letter characterization of the reliability function for variable-length block-coding 
schemes with perfect causal output feedback, generalizing the result obtained by Burnashev 
[5] for memoryless channels. Our assumptions are that the channel state is causally observable 
both at the encoder and the decoder and the stochatic kernel describing the channel satisfies 
some mild ergodicity properties. 
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As a first topic for future research, we would like to extend our result to the case when 
the state is either observable at the encoder only or it is not observable at neither side. We 
believe that the techniques used in [27j in order to characterize the capacity of FSMCs with 
state not observable may be adopted to handle our problem as well. The main idea consists 
in studying a partially observable Markov decision process and reduce it to a fully observable 
one with a larger state space. However some technical issues may appear since, in order to 
deal with average cost problems, we used finiteness of the state space in our proofs in Section 
m Finally, it would be interesting to consider the problem of finding universal schemes which 
do not require exact knowledge of the channel statistics but use feedback in order to estimate 
them. 

A Proofs for Section 3 

For the reader's convenience all statements are repeated before their proof. 
Lemma 2 Given any causal feedback encoder <i>, we have, for every t in N, 

P^Ap{t)>>^PMAp{t-l) P$-a.s. 
Proof A first observation is that 

n {n,ex{P{St+i,Yt\ St,x) = 0}) = , Vt GN. 
It follows that, P$-almost surely, for every t in N 

P{St+i,Yt\St,Xt)>Xs, > A. 
Let us fix an arbitrary message ti; in W. We have 

r^(w = vu\gt) > F^{w = w\gt)F^{St+i,Yt\gt) 

= F^{W = w\ Gt-i) P# {St+i,Yt\ W = w, Gt-i) 
= F^{W = w\Gt-i)P{St+uYt\St,Xt) 

> XF^{w = w\gt-i) . 

It follows that 

3* 



Z F^{w = w\gt) 

> E xr^iw = w\gt-i] 

> xpZ^p{t-i). 



Lemma 3 For any variable-length block-coding scheme (<I>,T, and any < 6 < ^, we 
have 

C,($, T)>(l-6- ^iiil^") log \W\ - H(<5) . 
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Proof For every n we introduce a random variable r„ describing the conditional message 
entropy given the information Qn available at the encoder at time n + 1. Consider the real 
valued random variable Vn defined by 



Hi := r„ + ^c(S't, T$,t) , ne 



t=i 



We claim that {Vn,Gn)nez+ ^ submartingale. Indeed, for every n in Z+, Vn is Gn- 
measurable, since r„ is, and so do both St and T^^t for every 1 < t < n. Moreover we 
have 



E$ [r. 



n-l 



Tn Gn-1 



= E$ 
= E$ 



^^^ F^ {Sn+l,Yn\W,gn-l) ^^ 

IP$ {Sn+l,Yn\ X„, Qn~l) i ^ 
IP* (5'n+l, ^1 ^n-l) 



IZIZ '^■^■,n(a;)-P(s+,j/|g,a;)log -y^^'^/x ' ^'^ , r 



C (Tf<I>,n) 5*71) 5 



the inequality in the formula above following from the data processing inequality once noted 
that, because of the causality of the encoder and the Markovian structure of the channel, 

forms a Markov chain. It follows that 

E# [Vn - Vn-1 1 Qn-l] = E$ [Vn " r„_i + C (T$,t, St) \ Qn-l] > . 

Moreover, (Vn) has uniformly bounded increments since 

\Vn-Vn-i\ < |c(T$,t,5t)i + |r„-r„_il < log 1^1 + 2 log I W| <+00. 

Doob's optional sampling theorem can thus be applied to the submartingale {Vn, 0n)nez+ 
and the stopping time ts, concluding that 



log |W| = E$[ro| ^o] = IE<i. [Vol Go] < IE$ [KJ = E$ [r,J + E$ 
Finally, combining (j65p with (j22p . we obtain 



TS 



,t=i 



(65) 



C5($,r) =E$ 



,t=i 



Lemma 4 Lei r and T be stopping times for the filtration Q such that t < T, and consider 
a partition of the message set as in 126\} . Then 



U < E, 



.t=T+l 



■"^ — a.s. , i = 0,1 
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Proof We will prove the claim for i = 0. Define for t > 



Zt := loj 



E P{St+i,Yt\St,x)r%^{x) 

E p{St+i,Yt\Sux)rl^{x) ' 



with the agreement log § = 0. We have that 



l^tl <21og-. 



(66) 



Indeed if P{St+i-,Yt\ St,x) = for every x in X, then Zt = hy definition. If instead there 
exists x in X such that P{St+i, Y^l St, x) > 0, then 



\Zt\ 



log 



E PiSt+uYt\St, x)r%,ix) 



E P(St+i,yt|St,x)Ti,,(x) 



< 2 log \ui{P{St^uYt\St,x)} 



21og-^ <21og^ 



It is easy to check by induction that for every n > 

{S'l+\Y-p\W e Wo) 



log; 



5"+\yi"|VF € Wi) 



(67) 



Indeed (j67p holds true for n = 0, since S*! is independent from W (with the agreement for an 
empty summation to equal zero). Moreover, suppose (j67p holds true for some n. Then 



log 



P$ {S^+^,Y{'+^\W e Wo) 
P$ {S'l+^ ,Y{'+^W £ Wi) 



log- 
log; 
log 





Wo)P* {Sn+2,Yn+l 


WeWo,£n+i) 


P$ {S'^^\Y{' 

p$ (55'+\yi" 


WeWi)W^ {Sn+2,Yn+l 
We Wo) E PiSn+2,Yn 

xex 


WeWi,£n+i) 

+1 \Sn+l,x)T^^n+l 


p$ (5i"+\yi" 

P.I, {S'^+\Y-P 

p$ (5i"+\yi" 


WeWi)Z P{Sn+2,Yn+l\Sn+l,x)rl^^^ 

xex 

'^''^i^z.^.jfzt. 



(x) 



Now, by applying the log-sum inequality and recalling the definition ([9]) of the cost d, we 
have, for ever t > 1, 



E^[Zt\W eWo,£t, 



Ed 



log 



E P{St+i,Yt\St,x)r%,{x) 

xex 



E P{St+i,Yt\St,x)rl,{x) 



W G Wo,£t 



E P{s,y\St,x)rl,{x) 

x&X 



y&y 



P{s,y\St,x)r%,{x) 
P{s,y\St,x)rl^{x) 



(68) 
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From (|67p and (|68|) it follows that, if we define 

(5|^+\yf jiy G Wi) 



V„. := loe 



t=l 



then iVn,Qn)n>o ^ submartingale with respect to the conditioned probability measure 
P<i>( • I G Wo)- Moreover it follows from ()66p (recall that we are assuming A > and that 
this is equivalent to the boundedness of K) that (Vn) has uniformly bounded increments: 

IK+I - Vn\ < \Zn+l\ + \d{St.,'^%^ri+l)\ < ^Og ^ + C?max < +00. 

Thus, since t <T, Doob's optional stopping theorem can be applied yielding 

E$ [VT-Vr\W £ Wo, Qr] < . (69) 

Then the claim follows from (|69p . after noticing that 



VT-Vr= log 



;T+1 \rT 
'r+2 ' 't+1 



\ (5j+2,n';i|VFGWi,^0 



— a.s. 



t=T+l 



Lemma 5 Let $ be any causal encoder, and r and T he stopping times for the filtration Q 
such that T <T. Then, for every 2^ -valued -measurable r.v. Wi, we have P$-a.s. 



Ed 



t=T + l 



>log_-iogP(^'/ lw,(W^)|g. 



where 



Z := min 



{P$ (W G Wol Gr) , P$ (TV G WllGr) } . 



Proof First we will prove the statement when Wi is a fixed, non-trivial subset of the message 
set W. From the log-sum inequality it follows that 



Ln 



log' 



WeWo,Gr 



^ = i\W eWi,Gr] log 



U =i| G Wi,Gr 



> Elf 

> -H (^P$(§ = l\W e Wo,Gt)) -P$ = 0| G Wo,^rj logPcji = 1| G Wo,ar 

> -log2 -P$ (^^ = 0\W € Wo,Gr^ logP$ (^^ = l\W e Wo,Gr 

We now consider the error probability of ^ conditioned on the sigma- field Gt- 

F^(^^ ^lw,iW)\Gr) = nw eWo\Gr)F^{^ = l\W eWo,Gr) 

+F^{W G Wi|g^)P$(^ = 0| ly G WuGt) 

> min{P$(VF G Wi|^^)}P$(§ = G Wo,e?T 
1=0,1 

= Z¥^{^ = l\W eWo,Gr)- 
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From Lemma [Hit follows that 



P 

log 



.t=T + l 



> - log 2 - 
> - log 2 - P$ ( = 01 G Wo, ) log 



w G Wo,gr 



^(^> = 0\W e Wo,gr) logP$ = l\W e Wo,gr 

1 



z 



^y^lWr{W)\g, 



An analogous derivation leads to 

T 



^ d{St,rl^t)\w ewi,gr 



.t=T+l 



(70) 



> -iog2-p$ (^^ = i\w ewi,gr) log / im{w)\ gr)^ . 



(71) 

If we now average (f70]l and ([TTI) with respect to the posterior distribution of given we 
obtain (j27p . Finally, since the claim holds true for every choice of Wo in 2^ \ {0, W}, then it 
continues to hold true also when Wo is a 2^ \ {0, W}- valued ^r-measurable random variable. 



Lemma 6 Let ^ be a causal feedback encoder and T a transmission time for <I>. Then, for 
every < 6 < 1/2 there exists a g^ -measurable random subset Wi of the message set W, 
whose a posteriori error probabilities satisfy 

l-Xd>¥{W £W^\grs)>^6, i = 0,l- 

Proof Suppose first that Pmap{ts) < Then, since clearly Pmap{t5 — 1) > 5, by Lemma 
[2] we have 

Pmap{t5) > XpMApirs - 1) > X5 
It follows that if we define Wi := {'^ map{ts)}, we have 

F^{W G Wil = 1 - PMApirs) >l-6>Xd, P$(iy ^ Wil g.J = Pmap{ts) >X6. 

If instead PuApi'T's) > the a posteriori probability of any message in W at time ts 
satisfies P$ {W — w\gT-g) < 1 — (5. Then it is possible to construct Wi in the following 
way. Introduce an arbitrary labelling of W = {wi,W2,---,w^y^;^}. For any 1 < i < |W|, 
define W(i) = {wi,...,w^}. Set := inf {l < i < |W| : P$ {W G W(i)[^t) >X6}, and define 
Wi = W(fc). Then clearly P$ {W G Wi| ^t) > A 6, while 

mw^Wi\gt) = i-F^{w £W^k)\gt) 

= i-r^{w £W^k.i)\gt) -r^{w = wk\gt) 

> 1- X6-{l-6) > X6. 
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B Proofs for Section 4 

Lemma 8 For every e > 0, and for every feasible policy tt 



\F{vn)\\>s + ^^ < 2|5|exp (-neV2) • 



Proof Let us fix an arbitrary admissible poHcy tt in IT. For every s in S consider the 
following random process: 

:= (n - l)F,K_i) + - l{s,=s} , n>2. 

We have 

= (n-l)F,K_i) + l|5„=,}-l{5^=,} 

= (n - l)vn-i i{s},U) + l{s„=s} - l{5i=4 -(n- I) Jg^uQsis\j,u)dvn-iij,u) 

n n 

= E i{5,=s}- EQ(s|5t-i,t/t-i) 

t=2 i=2 

n 

= T.{Ms,=s}-^.[Ms,=s}\£t-i]) . 

t=2 

It is immediate to check that is (f„-measurable. Moreover 

E^[Z^+i|£„] = Z^Vn>0, 

so that (Z^, f„, P7r)n>o ^ martingale. Moreover, (Z^) has uniformly bounded increments 
since {Zf — Zq \ = ai := 0, while 

\K.+1 -K\ = |l{5„+i=s} - [l{5„+i=s}l^n] I < a„+i := 1 , n > 1 . 
It follows that we can apply Hoeffding-Azuma inequality [19|, obtaining 

(iZ^+il > en) < 2exp = 2exp (-^n) . 

By simply applying a union bound, we can conclude that 

P^(||i^K)|| >e + ^) = P^(max,e5|Z^,+i + l{5i=4-l{5.+i=s}| >^^ + l) 



< EP7.(|Z^+i| >en) <2|5|exp 



Lemma 9 The map 7 is upper semicontinuous. (i.e. Xn ^ x ^ limsup„7(x„) < j{x)) 
Proof Possibly up to a subsequence, with no loss of generality we can assume that 

7(x„) limsup7(x„) . 
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Since S xU is compact, the Prohorov space T'{S x U) is compact as well [3]. Thus, since the 
map rj i-^ ||i^(r/)|| is continuous, the sublevel {||i^(?7)|| < x} is compact. It follows that for 
every n there exists r/^ in ^(5 x U) such that 

7(x„) = sup {(r/, 5) I 77 G V{S X U) : \\F{r])\ \ < x^} = {T]n,g) , \\F{Vn)\\ < x„ . 

Since P(5 x is compact we can extract a converging subsequence {r]n^); define r] := 
limfc?7„j^. Clearly 



|F(r7)|| =lim||F(T7„J|| <x, 



It follows that 



7(x) = sup {(77, 5) I 77 € P(5 X U) : < x} > (77,5) = lim(?7„j^, g) = limsup7(x„) 



Lemma 10 Let (r^) be a sequence of stopping times for the filtration T and (tt'^) he a 
sequence of feasible policies such that E^fc[Tfc] < 00 for every k and holds true. Then 



limP . >7(e) =0, Ve > . 
ken ^ \ V 

Proof For every m in Z4. such that P^fe (r^ > m) > we have 



P,. (Gf. > .(.)l n > m) = E (G?. > 7(.)l = 

< E {rt = i) (Gj.. > 7(e)l n = i) 

= P,. (G^^ > 7(e)) . 
An application of the Bayes rule thus gives us 

P^^ (Tfe > m) > P^fc (rfc > m\ G^^ > 7(e)) , V k s.t. P^. (g^^ > 7(e)) > , 

which in turns implies 

> E (Tfc > m\ G^^ > 7(e)) = [Tfc| G^^ > 7(e)] • ^ ' 

m>0 

On the other hand, for every e > 0, using a union bound estimation and (j45p we get, 

P^.(G:;>7(e + i)) = P^. (U>„{(^*,c)>7(e + ^)}) 

< El^.^((^*,c)>7(e + i)) 

< 2|5| E exp (-te2/2) (73) 
exp (-ne^/2) 



2151 



1 - exp (-£2/2) 
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It follows that for every M in N we have 

(G^^ > 7(e + 7^)) = P^. ({G^, > 7(£ + i?)} n {r^ > M}) + P^. ({G^^ > ^(^ + ^)} n {r, < M}) 

< E IP^^ {{Gi > 7(e + 7^)} n {Tfe = t}) + P^. in < M) 

t>M 

< E (Gi' > 7(e + ii)) + (Tfc < M) 

t>M l-exp(-eV2) 

= 7^ 2/0^^2 (-M6V2) + P.. (r, < M) , 

(1 - exp(-e72)) 

so that it follows from (1461) 



2|>S| 



hm sup P„. (G^^ > ^(e + ^)) < ' ' ^ (-^^V2) + hm sup P^. {n < M) 



< 



2\S\ 



(1 -exp(-e2/4|5|2)f 



exp(-MeV2) , 



and by the arbitrariness of M in N we get the claim. 



Lemma 12 In the previous setting, for every fixed M in N, we have 

limP^fc (Tfc < M) = , limP^fe (Tfc - < M) = . 

fcgN fceN 

Proof From Lemma [2] we have 
This implies that, for every M in N, 



PfiApiTkWk -Tk<M P^fe {Tk -Tk<M) 



> A,5fcA^^P$, (Tfc -Tk<M) . 

It follows that 

F$fc (Tfc - Tfc < M) < A-^^-^ ^^ ' ^ . 

In order to show the first part of the claim, suppose that PMApi'''^) ^ ^k- Then 

I.,.. I A < Pmap in) < Ok ■ 

\yvk\ 

It follows that we have 

P$. {Tk <M) < P^fe ({Tfe < M} n [pf^Api^k) <6k])+ P$. (^'MAp(rfc) > 5k 
< P^, (^^A^^<4)+P^.(rfc = r,)'=^0. 
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